108 research outputs found
Demonstrating 100 Gbps in and out of the public Clouds
There is increased awareness and recognition that public Cloud providers do
provide capabilities not found elsewhere, with elasticity being a major driver.
The value of elastic scaling is however tightly coupled to the capabilities of
the networks that connect all involved resources, both in the public Clouds and
at the various research institutions. This paper presents results of
measurements involving file transfers inside public Cloud providers, fetching
data from on-prem resources into public Cloud instances and fetching data from
public Cloud storage into on-prem nodes. The networking of the three major
Cloud providers, namely Amazon Web Services, Microsoft Azure and the Google
Cloud Platform, has been benchmarked. The on-prem nodes were managed by either
the Pacific Research Platform or located at the University of Wisconsin -
Madison. The observed sustained throughput was of the order of 100 Gbps in all
the tests moving data in and out of the public Clouds and throughput reaching
into the Tbps range for data movements inside the public Cloud providers
themselves. All the tests used HTTP as the transfer protocol.Comment: 4 pages, 6 figures, 3 table
Defining a canonical unit for accounting purposes
Compute resource providers often put in place batch compute systems to
maximize the utilization of such resources. However, compute nodes in such
clusters, both physical and logical, contain several complementary resources,
with notable examples being CPUs, GPUs, memory and ephemeral storage. User jobs
will typically require more than one such resource, resulting in co-scheduling
trade-offs of partial nodes, especially in multi-user environments. When
accounting for either user billing or scheduling overhead, it is thus important
to consider all such resources together. We thus define the concept of a
threshold-based "canonical unit" that combines several resource types into a
single discrete unit and use it to characterize scheduling overhead and make
resource billing more fair for both resource providers and users. Note that the
exact definition of a canonical unit is not prescribed and may change between
resource providers. Nevertheless, we provide a template and two example
definitions that we consider appropriate in the context of the Open Science
Grid.Comment: 6 pages, 2 figures, To be published in proceedings of PEARC2
glideinWMS - A generic pilot-based Workload Management System
The Grid resources are distributed among hundreds of independent Grid sites, requiring a higher level Workload Management System (WMS) to be used efficiently. Pilot jobs have been used for this purpose by many communities, bringing increased reliability, global fair share and just in time resource matching. GlideinWMS is a WMS based on the Condor glidein concept, i.e. a regular Condor pool, with the Condor daemons (startds) being started by pilot jobs, and real jobs being vanilla, standard or MPI universe jobs. The glideinWMS is composed of a set of Glidein Factories, handling the submission of pilot jobs to a set of Grid sites, and a set of VO Frontends, requesting pilot submission based on the status of user jobs. This paper contains the structural overview of glideinWMS as well as a detailed description of the current implementation and the current scalability limits
Porting and optimizing UniFrac for GPUs
UniFrac is a commonly used metric in microbiome research for comparing
microbiome profiles to one another ("beta diversity"). The recently implemented
Striped UniFrac added the capability to split the problem into many independent
subproblems and exhibits near linear scaling. In this paper we describe steps
undertaken in porting and optimizing Striped Unifrac to GPUs. We reduced the
run time of computing UniFrac on the published Earth Microbiome Project dataset
from 13 hours on an Intel Xeon E5-2680 v4 CPU to 12 minutes on an NVIDIA Tesla
V100 GPU, and to about one hour on a laptop with NVIDIA GTX 1050 (with minor
loss in precision). Computing UniFrac on a larger dataset containing 113k
samples reduced the run time from over one month on the CPU to less than 2
hours on the V100 and 9 hours on an NVIDIA RTX 2080TI GPU (with minor loss in
precision). This was achieved by using OpenACC for generating the GPU offload
code and by improving the memory access patterns. A BSD-licensed implementation
is available, which produces a C shared library linkable by any programming
language.Comment: 4 pages, 3 figures, 4 table
Characterizing network paths in and out of the clouds
Commercial Cloud computing is becoming mainstream, with funding agencies
moving beyond prototyping and starting to fund production campaigns, too. An
important aspect of any scientific computing production campaign is data
movement, both incoming and outgoing. And while the performance and cost of VMs
is relatively well understood, the network performance and cost is not. This
paper provides a characterization of networking in various regions of Amazon
Web Services, Microsoft Azure and Google Cloud Platform, both between Cloud
resources and major DTNs in the Pacific Research Platform, including OSG data
federation caches in the network backbone, and inside the clouds themselves.
The paper contains both a qualitative analysis of the results as well as
latency and throughput measurements. It also includes an analysis of the costs
involved with Cloud-based networking.Comment: 7 pages, 1 figure, 5 tables, to be published in CHEP19 proceeding
Running a Pre-Exascale, Geographically Distributed, Multi-Cloud Scientific Simulation
As we approach the Exascale era, it is important to verify that the existing
frameworks and tools will still work at that scale. Moreover, public Cloud
computing has been emerging as a viable solution for both prototyping and
urgent computing. Using the elasticity of the Cloud, we have thus put in place
a pre-exascale HTCondor setup for running a scientific simulation in the Cloud,
with the chosen application being IceCube's photon propagation simulation. I.e.
this was not a purely demonstration run, but it was also used to produce
valuable and much needed scientific results for the IceCube collaboration. In
order to reach the desired scale, we aggregated GPU resources across 8 GPU
models from many geographic regions across Amazon Web Services, Microsoft
Azure, and the Google Cloud Platform. Using this setup, we reached a peak of
over 51k GPUs corresponding to almost 380 PFLOP32s, for a total integrated
compute of about 100k GPU hours. In this paper we provide the description of
the setup, the problems that were discovered and overcome, as well as a short
description of the actual science output of the exercise.Comment: 18 pages, 5 figures, 4 tables, to be published in Proceedings of ISC
High Performance 202
Testing GitHub projects on custom resources using unprivileged Kubernetes runners
GitHub is a popular repository for hosting software projects, both due to
ease of use and the seamless integration with its testing environment. Native
GitHub Actions make it easy for software developers to validate new commits and
have confidence that new code does not introduce major bugs. The freely
available test environments are limited to only a few popular setups but can be
extended with custom Action Runners. Our team had access to a Kubernetes
cluster with GPU accelerators, so we explored the feasibility of automatically
deploying GPU-providing runners there. All available Kubernetes-based setups,
however, require cluster-admin level privileges. To address this problem, we
developed a simple custom setup that operates in a completely unprivileged
manner. In this paper we provide a summary description of the setup and our
experience using it in the context of two Knight lab projects on the Prototype
National Research Platform system.Comment: 5 pages, 1 figure, To be published in proceedings of PEARC2
Microarchitecture: A useful tool to organize machines in heterogeneous shared computing environments
The x86_64 instruction set architecture is not a single, consistent, compatible interface to execute computer programs. Since the initial release in 1999, every new generation has added new instructions, some of which were later removed. Most of these new instructions are intended to improve the performance of those programs which explicitly take advantage of them. However, running such a program on older CPUs without appropriate support, results in Linux SIGILL exception signal, which is difficult for end users to diagnose. On the other hand, compiling scientific code for the least common denominator ISA can leave significant performance on the table. High Throughput systems, containing very large number of machines, cannot require a single CPU version across hundreds of thousands of machines operating in dozens of sites. The OSG Open Science Pool alone consists of more than 20 different, subtly incompatible X86_64 implementations. In 2020, Intel, AMD and RedHat proposed new terminology and partitioned these dozens of microarchitectures into a strict hierarchy of four groups. The HTCondor Software Suite and the OSG now have first class support for these microarchitectures. This paper discusses the advantages for users and future work around microarchitecture support
Recommended from our members
An objective comparison test of workload management
Grid resources are distributed among hundreds of independent Grid sites, requiring a higher level Workload Management System (WMS) to be used efficiently. There are several ways to design and implement a WMS, and indeed in recent years several WMSes have been developed. The purpose of this paper is to show how some of these different WMSes behave under realistic load conditions. We present benchmark test results for three general-purpose WMSes, namely ReSS, gLite WMS and glideinWMS. The results presented have been measured using the same tools for all the tested WMSes, comparing those results against a baseline obtained by using plain Condor-G submissions
- …